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Abstract 



We discuss the property of a.e. and in mean convergence of the Kohonen algorithm considered 
as a stochastic process. The various conditions ensuring the a.e. convergence are described and 
the connection with the rate decay of the learning parameter is analyzed. The rate of convergence 
is discussed for different choices of learning parameters. We proof rigorously that the rate of decay 
of the learning parameter which is most used in the applications is a sufficient condition for a.e. 
convergence and we check it numerically. The aim of the paper is also to clarify the state of the 
art on the convergence property of the algorithm in view of the growing number of applications 
of the Kohonen neural networks. We apply our theorem and considerations to the case of genetic 
classification which is a rapidly developing field. 

1 Introduction 

Data clustering ([T]-[5]) is a basic technique in gene expression data analysis since the detection 
of groups of genes that manifest similar expression patterns might give relevant information. 
Therefore it is important to have a good control on the properties of clustering algorithms. The 
Kohonen algorithm ( or Kohonen neural network) ([B]- [5]) is currently used in this field. The 
Kohonen neural networks are different from the other neural networks like back propagation or 
the Hopfield model ([9]- [12] ). The main difference is that there is only a single layer of units ( 
named neurons) and the output of the network is just a vector or a scalar associated with each 
neuron called weight vector. These networks are commonly used for classifying sets of experimental 
data. The weight vector associated with the neuron represents a characteristic vector of a certain 
subset of the data. The set of these subsets constitutes a disjoint partition of the measures. The 
sets of the partition are also called clusters and in the applied science there are many different 
algorithms which construct clusters from a data set. Many of these algorithms have the drawback 
that they depend on arbitrary choice of some parameters and therefore the clustering results 
might be non unique. The main feature of clustering by means of the Kohonen algorithm is 
that it depends only on the choice of a special function, the learning parameter, which has been 
extensively characterized. The process of individuation of the weights is called the learning process 
and the Kohonen algorithm is a special learning process. This algorithm consists in extracting 
at each time step n a number or a vector from the data set and subsequently the nearest weight 
to this data is modified of a quantity proportional to the difference among these two vectors 
multiplied by a parameter. This parameter is the learning parameter, and it must decrease with 
n. The convergence of the learning process strongly depends on the rate of decay of the learning 
parameter and the investigation of this point is one of the main topics of this paper. An important 
characteristic of the Kohonen algorithm is the Self Organization(SO) which can be understood 
as the fact that the sequence of the weights converges to a unique limit independently from the 
chosen sequence of the data presented to the network and from the initial values of the weights. 
In the language of the stochastic processes we can express this fact by observing that the sequence 
of the weights is a stochastic process and SO is equivalent to the a.e. convergence of the learning 
process. This property is rather strong and it is supposed to hold in many applications of the 
Kohonen networks but unfortunately it is not trivial at all. This is especially true for genetic 
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application where the set of clusters (atoms) describes different cell conditions or different genes 
function. In order to have a real biological meaning the classification should be independent on 
the initial conditions of the weights and from the input sequence. So it is worth investigating, 
both theoretically and numerically, the connection among the a.e. convergence and the possible 
choices of the learning parameter 77(77) and the different versions of the Kohonen algorithm. There 
are already many important results on this subject ([13]- [28]). All these results show that there 
is a critical dependence of the a.e. convergence on the probability distribution of the data, on the 
choice of the learning algorithm and on the velocity to approach zero by 77(77) . In this paper we 
generalize the results obtained in the paper of Feng and Tirozzi ([25]) relaxing the condition on 
the convergence of the series J2 n vi 71 ) 2 ( but of course it is assumed that 77(77) — ► Q), so only the 
condition ^ n 77(71) — > 00 remains. The condition ^ n 77(71) 2 < 00 is used in all the other versions 
of the theorems of convergence, but we have verified numerically that it implies a too rapid 
convergence to zero of the learning parameter. So the good decrease rate for 77(77) is to go to zero 
more slowly than 1/77. But our theorem does not exclude the l/n decay rate since it also satisfies the 
condition Y) 77(77) — > 00. Thus this theorem gives a support to the property of a.e. convergence 
for the right decay of 77 but it is uncompleted because we cannot show that the stronger condition 

rj(n) 2 < 00 spoils the a.e. convergence. The previous results also are troublesome because we 
are faced with the fact that a theorem with a defined proof of convergence does not correspond 
to the numerical simulations. The only thing we can say is that at least our version of the 
convergence theorem picks the right decrease property. There is a well known general explanation 
about the right choice of the rapidity of the learning parameter decay which is connected with 
the existence of meta-stable points. In analogy with the Simulated Annealing (SA) we can say 
that the learning parameter corresponds to the temperature and it is a well known fact that a too 
rapid decrease of the temperature in the SA makes the algorithm stop on the local minima of the 
energy function. The unlucky situation is that in the case of the Kohonen algorithm there is no 
such a function. In many proofs of the convergence one can find some functional with a similar 
property but they are not the energy or the Liapunov function. The other important question 
tackled in this paper is about the rate of the convergence of the algorithm: since the condition 

r /( n ) — * 00 can be satisfied by many different 77(77) we compare the different choices analyzing 
the velocity of approach to the limit of the corresponding algorithm. This question is important 
in any case but has special relevance in gene clustering where the data set is the set of expression 
levels of M genes, M being rather large thank to the application of the microarrays technique. The 
meaning of M in our construction is the maximum number of iterations of the learning process. 
Another question considered in this paper is the analysis of the relation of the a.e. convergence 
with the probability distribution of the data and also with the different versions of the Kohonen 
algorithm. We then apply all these results to the problem of clustering and classifying the great 
number of genes revealed in the microarrays experiments. The possibility of applying clustering 
algorithms ( not only the Kohonen algorithm) in genetics appeared with the development of the 
DNA microarrays technology. The micro-array allows to monitor simultaneously the expression 
levels of thousands of genes during important biological processes. Elucidating the patterns hidden 
in the gene expression data is a tremendous opportunity for functional genomic. However, because 
of the large number of genes and complexity of biological networks, it is difficult to interpret the 
resulting mass of data; so the clustering techniques become essential in data mining process for 
identifying interesting distributions and patterns in the underlying data. 

Clustering algorithms have simplified the grouping of genes with similar biological expression. 
Co-expressed genes found in the same cluster suggest functional similarities. Gene clustering also 
becomes the first step to uncover the regulatory elements in transcriptional regulatory networks. 
Co-expressed genes in the same cluster might be involved in the related cellular process and strong 
expression pattern correlation between those genes suggests co-regulation. 

There is a large literature on cluster analysis and genetic one (pQ - [5]); numerous approaches were 
proposed on the basis of different quality criteria and not all the algorithms are well founded. In 
addition the results of the algorithms depend strongly on many arbitrary choices, for example on 
the initial conditions and the value of the threshold. 

The main topic of the last section of this paper is an application of the Kohonen algorithm to 
a concrete problem of gene classification. The aim is to find the genes which are over expressed 
during the treatment of tumor cells of mice using a clustering technique that has the minimum 
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arbitrary choices. 

The analysis made in the first sections of this paper convinced us to use the Kohonen algorithm. 
We compare the results obtained with the Kohonen algorithm to this problem with the ones 
obtained using the PCA (Principal component analysis) and Hierarchical clustering algorithm 
f|30p. This is the first step of a larger work of comparing the results of gene classification obtained 
by means of different algorithms. We think that this work is necessary in order to validate the 
gene clustering. 

Another important issue is the variability of the expression levels of genes obtained by different 
samples which cannot be considered equal. For economic and time reasons it is difficult to have 
more than three biological realizations of the experiment and this is the origin of an error in the 
data. The errors influence the structure of the clusters so it is possible that a gene changes cluster 
if we take into account this error in the analysis. In our work we have included explicitly this 
effect and evaluated its influence on the results. 

The structure of the paper is the following. In Section 2, after a short intuitive introduction, we 
show the algorithm and explain its properties using a precise mathematical formulation, enunciate 
the theorems and give the proofs . In Section 3 we show the results of the numerical simulations. 
In Section 4 we show the applications to the mice data and our results. In Section 5 we give our 
conclusions. 

2 The Kohonen Network 
2.1 An intuitive description 

The Kohonen Network ([6]- [8]) is formed by a single layered neural network. The data are pre- 
sented to the input and the output neurons are organized with a simple neighborhood structure. 
Each neuron is associated with a reference vector ( the weight vector), and each data point is 
"mapped" to the neuron with the "closest" ( in the sense of the Euclidean distance) reference 
vector. In the process of running the algorithm, each data point acts as a training sample which 
directs the movement of the reference vectors towards the value of the data of this sample. The 
vectors associated with neurons, called weights, change during the learning process and tend to 
the characteristic values of the distribution of the input data. The characteristic value of one 
cluster can be intuitively understood as the typical value of the data in the cluster and will be 
defined more precisely in the next subsections. At the end of the process the set of input data 
is partitioned in disjoint sets ( the clusters) and the weight associated with each neuron is the 
characteristic value of the cluster associated with the neuron in one dimensional case, which is 
the case of interest to us. We limit our analysis to this case because the condition of convergence 
of the algorithm is easier to check, the cluster of the partitions are easier to visualize and it is 
not difficult to compare the behavior of the genes in the clusters corresponding to the different 
biological conditions. Each neuron individuates one cluster, the physical or biological entities with 
measure values belonging to the same cluster are considered to be involved in the same cellular 
process. Thus the genes with expressions belonging to the same cluster might be functionally 
related. 

The following points show the main properties which make the Kohonen network useful for clus- 
tering : 

1. Low dimension of the network and its simple structure. 

2. Simple representation of clusters by means of vectors associated with each neuron. 

3. Topology of the input data set is somehow mapped in the topology of the weights of the 
network. 

4. Learning is unsupervised. 

5. Self-organized property 

The points l)-2) are simple to understood and many examples are shown in the Section 3. The 
point 3) means that neighboring neurons have weight vectors not very different from each other. 
The point 4) means that there is no need to have an external constraint to drive the weights 
towards their right values beside the input to the network and that the learning process finds by 



3 



itself the right topology and the right values. This holds only if the learning process with which 
the network is constructed converges a.e. or if the mean values are taken. The self-organization is 
formulated in the current literature referring to some universality of the structure of the network 
for a given data set. It is connected to the point 3) and is also a consequence of a.e. convergence 
or of the convergence of the average of the weights over many different learning processes. 

2.2 Exact definition 

In this subsection we give the definitions using exact mathematical terms. We restrict ourselves 
to the particularly simple one dimensional case which is the most interesting for our applications. 
First we show how the Kohonen network is used for classification and then what is the process of 
its construction. Let I = Ii, ., In be a partition of the interval (0, A) of the possible values of the 
expression levels in the intervals Ii, \J i Ii = (0, A) and Ii f] Ij = 0. 

Suppose that the construction of the Kohonen network has been already done and the Ii are the 
clusters. 

Let oji, i = 1 , N be the weights or the characteristic values of the clusters which will be exactly 
defined below. Then a data £ is said to have the property i if £ e Ii. 
The classification error is 



where /(£) is the density of probability distribution of the input data and T — J2i=i \Ii\,wtth \Ii\ 
is the number of data in the set Ii. 

The partition / = Ii,.,In is optimal if the associated classification error E is minimal. The 
characteristic vectors uii are the values which minimize E. Before giving exact definitions let us 
explain in simple terms the procedure for determining the sets Ii and the associated weights Wj. 
Let x(l), . . . , x(P) be a sequence of values randomly extracted from the data set, distributed with 
the density f(x) and take randomly the initial values wi(0), wjv(0) of the weights. When an 
input pattern x(n), n = 1, .., P, is presented to the network all the differences \x(n) — oJi(n — 1)| 
are computed and the winner neuron is the neuron j with minimal difference |£(n) — uij(n — 1)|. 
The weight of this neuron is changed in a way defined below, or, in some cases, the weights of the 
neighboring neurons arc changed. Then this procedure is repeated with another input pattern 
x(n + l) and with the new weights u>i(n) until the weights u>i(n) converge to some fixed values for 
P large enough. 

In this way we get a random sequence ui(n) = (u)i(n), ...,Lj N (n)) which converges a.e, under suit- 
able conditions on the data set, with respect to the choice of the random sequence of data and 
the random choice of the initial conditions of the weights, n is the number of iterations of the 
procedure. The learning process is the sequence uj(n) and the S.O. (self-organizing property) coin- 
cides in practice with the almost everywhere convergence of ui{n). The learning process converges 
somehow to the optimal partition in the Kohonen algorithm. In fact the algorithm can, with some 
approximation, be viewed as a gradient method applied to the function E: 



In one dimension the Kohonen algorithm in the simplest version of the winner-take-all case is : 

1. Fix N. 

2. Choose randomly at the initial step (n = 0) the ( < i < N). 

3. Extract randomly the data £(1) from the data set. 



Then the global classification error E of the network is 




(2.2.1) 



uj t {n + 1) = LUi(n) + »j(n)[£(n) - Wj(n)] 
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4. Compute the modules 



MO) -£(1)| i = l,...,N 



5. Choose the neuron v such that 



K(0) -£(i)l 



is the minimum distance, v is the winner neuron 



6. Update only the weight of the winner neuron: 



U! v (l) 



^(0) + r?(l)(£(l)-^(0)) 



7. n = n + 1 

One of the basic property of the Kohonen network is that the weights are ordered if the learning 
process converges. 

We remind the definition of the order of a one dimensional configuration: 



the order holds also for the other inequality |w r — w B \ > \u> r — L0 q \. Then: 
The ordering property is: 

If the Kohonen learning algorithm applied to one dimensional configuration of weights converges 
the configuration order itself at a certain step of the process. The same order is preserved at each 
subsequent step of the algorithm 

This property allows to check when the algorithm converges since the final configuration of weights 
must be also ordered and it is a necessary property for a.e. convergence. We also check the remark- 
able property proved by Kohonen ([6]-[8]) that the mean process uj(n), i.e. the process obtained by 
averaging with respect different choices of the sequence x(l), ...x(P) is always converging. But for 
getting the a.e. convergence from the convergence of the mean values additional hypothesis must 
be used and the discussion and the applications of these results to a case of genetic classification 
is the main topic of this paper. 

2.3 General formulation 

We describe now the Kohonen algorithm in more general terms for allowing the treatment of all 
the possible cases. 

The Kohonen network is composed by a single layer of output units Oi, i = 1, ...N each being fully 
connected to a set of inputs £j(n), j = 1, ...,M. An M dimensional weight vector u>i(n) ,uii(n) = 
(ujij(ri),j = 1, ...M) is associated with each neuron, n indicates the n-th step of the algorithm. 
We assume that the inputs £j(n),j — 1, M are independently chosen according to a probability 
distribution f(x). For each input £j(n),j — 1, ...,M we choose one of the output units, called the 
winner. The winner is the output unit with the smallest distance between its weight vector uj v (n) 
and the input 



where Ia is the characteristic function of the event A, i.e, Ia(^) = 1 if a; G A and 1a(x) = if 
x £ A. 

This function selects the event in which the weight of the neuron v is the nearest to the input 
data £(n) and it is necessary for writing the learning process in a compact form. The generalized 
Kohonen algorithm updates the weights of the neurons belonging to a given neighbor of the winner 
neuron: 



\r- s\ <\r — q\-& \iv r - v s \ < \w T - w q \, Vr,s,g e {1,2..., N} 




I(u} v (ri),£(n + 1)) - I{|| Wv (n)-C(n+l)||<|| Wj (n)-C(n+l)||, j^v} 



+ 1) = u>ij(n) + r](n)T(i,v)l(u) v (n),^(n + 1)) (fj(r&+ 1) 



UJij(n)) 



(2.3.1) 
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i = 1, N e j = 1, ...M or in vector form 



uji(n + l) = uji{n) + r](n)T(i,v)I(uj v (n),i(n + 1)) (£(n + 1) - w;(n)) 



(2.3.2) 



where r](n) is the positive learning parameter 77(0) < 1, rj(n) > ?7(n + 1) and T(i,v) is a non 
increasing function of \i — v\, the distance among the neuron i and v on the lattice where the 
neurons of the network are located. 

This version is more general than the winner-take-all rule explained before. Not only the weight of 
the winner neuron is updated but also the weights of the neurons which belong to a neighborhood 
defined by the function T(i,v). We discuss various choices of the function T(i,v) below. After 
the learning procedure is finished, the set of input vector will be partitioned into non overlapping 
clusters. This means that a new signal £(n + 1) is classified as the pattern i if and only if 

|k-£(n + l)|| < ||^-£(n+l)||, j±i 

Let us introduce the definition of Voronoi tessellation Ii(y) = (H(y)i, i = 1, ...N) associated with 
a family vectors yi, ...,j/jv € f2 , f2 being a given compact of M. M . 

Definition 2.1. For a given compact subset £1 E R M , the Voronoi tessellation H(y) = ( H(y)i, 
i = 1, ..N) associated with a family of vectors y\, yjy is the partition of Q: 



n (2/)* = {x, hi - x\\ < \\y-j - x\\,j ^i} i = l, ...N 



(2.3.3) 



Therefore a Voronoi cell of an unit i contains those vectors which are closer to the weight 
LOi than to the other weights. The characteristic values mentioned before are the limit of the 
sequences of the vectors LOi(n) defined by the above algorithm and are weights of the Voronoi 
tessellation obtained in the limit. 

A crucial point of the algorithm is the choice of the neighborhood function T(i,v) of the winner 
neuron. It determines the region around the winner neuron where there are the neurons which 
update their weight vectors together with the winner neuron. A convenient choice is the finite 
region of activation of the winner neuron, i.e. r = A where : 



A(i,v) = { 



lif\i- 
othe 



v\ < s 



where |.| represents the distance between the neuron i and the winner neuron v. 
If s = 1 and the neural network is one dimensional, the region of activation includes the winner 
and the two nearest units (figure [TJ ; if the network is designed in two dimension then the range 
includes the eight nearest neighbor units near the winner . 

If A(i,v) = Si V the algorithm coincides with the winner takes all algorithm we described in the 
previous section. 



Neuron v 



Figure 1: Neighborhood function A(i, v) 



Another choice, often used in the applied research, for the neighborhood function T is a gaussian 



function h denning a region around the winner neuron with amplitude decreasing with the number 
of iterations of the learning process: 

h(M,n) = exp(-^4-) (2.3.4) 
o\ny 

where u(n) is a decreasing function. A commonly used choice is: 

a(n) = (7i(— 

where n max is maximum number of iterations of the algorithm and erf, Oi are respectively the 
final and initial value of the parameter er(figure pj. 




Neuron v 

Figure 2: Neighborhood function h(i,v,n) 



2.3.1 The theorem of convergence 

The first result about the algorithm convergence was found by Kohonen ([7]). He concentrated 
on one-dimensional mapping and demonstrated that the weights converge in mean to the limit 
values. Although the result is enunciated as a.e. convergence in this paper only the convergence 
in mean is proven. The convergence in mean is obtained by making the average of the weights on 
many different sequences of patterns x(n). The ordering of the weights has been proved in ([7j) 
for the winner-take-all process. 

In the paper of Erwin et al. (|16|.[TT]) there is a proof of ordering for one-dimensional case which 
holds for any neighborhood function which is monotonically decreasing with distance and in the 
case of non uniformly distributed input. 

Many other authors ([13], [19], [21], [22], [23], [25], [28],) investigated the convergence properties 
of the Kohonen algorithm in one and more dimensions, someone by viewing the weight values as 
states of a Markov process, others using the ordinary differential equations for the mean values 
of the network. But the main results have been limited to one dimensional map where the prop- 
erty of order is valid and under certain conditions on rj(n), the expectation of the values weights 
converges to a unique value. The existence and uniqueness of the minimum is ensured by the 
existence of a unique minimum of some functional, but the existence of the minimum is difficult 
to check for non uniform distribution of the input values especially in the multidimensional case. 
In more than one dimension, despite the robustness of the algorithm which has been used suc- 
cessfully in many different application area, there is still no proof of a necessary and sufficient 
condition for the convergence of the algorithm. There are proofs of sufficient conditions and only 
a few for the multi dimensional case, see for example Feng and Tirozzi ( 25J), Lin and Si (|19j). 
Sadeghi ([23 ). Lin and Si have shown that the distribution of the weight values converge to a 
stationary state introducing and studying the same objective function proposed by Ritter and 
Schulten 21J. In the paper of Feng and Tirozzi the convergence problem of the Kohonen feature 
mapping algorithm has been proven by using stochastic approximation theory. But in all these 
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papers the rate of decrease of the learning parameter is too fast and so these theorems are con- 
tradicted by numerical results. Only in the paper of Feng and Tirozzi it is mentioned explicitly 
that the rate of decrease of the learning parameter of these theorems is too fast and there is a 
proposal for a slower decay. In this paper we proof that there is a.e. convergence if the rate is the 
one of numerical simulations, but we can show only the sufficiency of this condition. Moreover a 
condition of the existence of a global attractive minimum is always required. 
If there is no global minimum there is no a.e. convergence and the algorithm remains stacked, 
as in the case of simulated annealing, in some points which might not even be ordered and then 
the convergence is obtained only by averaging with respect to the sequences of learning examples, 
which is happening for the genetic data in general. Now we start to expose the definitions and 
concepts used in our proof. We first explain the definitions used in the book of Nevel'son and 
Has'minski ( 20J) which will be used in the proof of our main theorem. Let £(fc), k < n be the 
sequence of random patterns presented to the network during the learning and T n the a-algebra 
generated by them, E(C| Fn) is the conditional expectation of the random variable C with respect 
to the sigma algebra T n . 

Our aim is to prove that the process of the weights uj(n) converges to a certain set B C l )VxM 
(the limit set), so we need the definitions summarized in the following list: 



Definition 2.2. . 



1. A distance between vectors y £ Wi M and cji £ K p(^i, y), with i = 1, ...N. 

2. A distance from the point uj(n) and the set B: p(uj(n), B) = inf y ^Bp(u>,y). 

3. An e neighborhood of B, U e (B) = {lu : p(uj,B) < e}. 

4- The complementary set of this neighborhood V e {B) — R NxM \U 5 (B). 

5. The intersection of the complementary set with a sphere of radius R: V e ^(B) = V e (B) PI 
{w(n) : \uj(n)\ < R}. 

6. A positive definite Lyapunov function W(n,u>), u> £ M. NxM . 

7. An operator LW{n,ui) — E (W(n + l,u>(n + 1)) — W{n,uj(n))\!F n ) defining a kind of first 
difference of the Lyapunov function by means of conditional expectation. 

8. A negative function g(n, u>) used for bounding the increments of the Lyapunov function 
W{n, u>) such that 

inf [-g(n,u)}>0 (2.3.5) 

n>Q,u£V<,, B (B) 

for all R > e > and some Q = Q(e, R). 

Let us briefly comment these definitions. 

1) As we have seen before a;, and y are M. M vectors and since we have to compare their difference 
it is necessary to introduce the module of these vectors. 

2) In the general case the limit point might be a set so the distance of a point from a set must be 
defined. 

3) As is usual in the theory of limits one needs to find a neighborhood U e (B) of the limit points 
B which differ from B by a small portion. 

4) It is also necessary to introduce the complementary set V £ (B) of this neighborhood. 

5) For doing the estimates of asymptotic limits of series or functions it is useful to introduce a 
spherical subset V(B) e R of V £ (B). 

6) In analogy with the theory of stability in order to show the convergence of a trajectory of a 
dynamical system it is useful to have a Liapunov function and compute its increments. In this 
case we do not have an usual dynamical system but a stochastic sequence. 

7) The consequence of this fact is that the derivative ( or increments) of the Liapunov function 
is not the usual one but is a conditional expectation. The convergence holds for the sequence 
W(n, uj) as a consequence of Doob's theorem of convergence for martingales but it is difficult to 
use the concepts of local and global minimum in this situation. In our theorem the concept of 
global minimum in the classical sense is introduced but for the bounding function g(n,u>). 

We will use this theorem of ([20]) in our proof of the a.e. convergence: 
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Theorem 2.1. Suppose that there exist a function W(n,u>) > such that: 

LW(n,uj) < 77(77)3(77, w) (2.3.6) 



where n > 0, uo £ R NxM and g the function which satisfies the above statement (2.3.5) 
Moreover let : 

+oo 

^2 V{n) = +oo , 77(71) > (2.3.7) 

n=l 

and: 

inf n >oW(n,id) — ► +oo \uj\ — > +oo (2.3.8) 
Then, considering the previous definitions: 

P{sup \u(n)\ = R < +00} = 1 (2.3.9) 

n 

+00 

P{X>(«)h?(«M«))] <+^} = 1 (2-3.10) 

M = 

Pjliminf p(uj(n),B) = 0} = 1 (2.3.11) 

n — >+oo 

We can say that a random process u)(n) converges a.e. to a limit set B if it is possible to 
find a Liapunov function W(n, oj(n)) of the process such that the conditional expectation of its 
increments are less than a function g(n,uS) multiplying the learning parameter 77(71), then uo{n) 
converges to B, if the function g is negative in a certain spherical neighborhood of B and if the 
learning parameter decreases not so quickly. So it is enough that 

lim ?7(n) = 

n — >+oo 

in order that the a.e. convergence of the weights holds. The interesting fact is that the stronger 
condition 

+00 

77(n) 2 < +00. 

n=l 

is not introduced. The result of this theorem is neat because the condition 

lim 77(71.) = 

n — >+oo 

is the one used in the numerical applications. In Section 3 we will give many examples of "good" 
and "bad" decay of 77(71). The choice of 77(71) is important also for the speed of convergence of the 
process. Another key role for the a.e. convergence is the form of the probability distribution of 
the data as it will be clear from the theorem we present below. 

In order to understand it we need other definitions. Let us introduce a function g which is the 
leading term of the super martingale difference given in the proof of theorem 2.2 It is a particular 
realization of the function g used in the theorem of Nevel'son and Has'minskii: 

N 

ff(2/i,2/2, -,2/jv;w 1 ,w 2 , ...wjv) = zZiVi ~ ^i) • Yj( / A(k,i)(x - y i )f{x)dx). (2.3.12) 

where w,-(n) = (wy(n),j = l,...M,i = 1,...,N) and 77,(71) = (j%(n),j = l,...M,i = l,...,iV) € 
RMxAf / is the density of the probability distribution of the dat a with support on a compact set 
Q of E M , n(y) is the Voronoi tessellation associated with 77 (see (2.3.3 1). (yi — uii) ■ (x — yCj is the 
M-dimensional scalar product. 

We define also: 

= {the set of all Voronoi tessellations associated with {ui\{n), . . . , w/v (n) } for all n\ 

For 77 £ K M we use the convention that 77 £ implies that there exists a Voronoi tessellation H(y) 
such that {H(y)i, i — 1...N} £ Q. Finally we can enunciate our theorem: 



9 



Theorem 2.2. Let the vectors u(n) £ u MxJV f, e updated by the Kononen algorithm (2.3.2) 

Wi(n+1) =Wi(n) + rj(n)A(i,v)I(uj v (n),t(n + l)) ■ (£(n+ 1) - Ui(n)) 
if there exists a unique point u> = (uii, ...u)at) £ M. MxN such that for each y — (2/1,2/2, • ••,J/iv)- 

g(yi,V2, ■-, Vn\ &i, •■•wat) < Vy S 6 (2.3.13) 
where the equality holds if and only if yi — i = 1, ...N and : 

+oo 



V 77(71) = +oo lim 77(71) = (2.3.14) 

^ — ' n — >+oo 



n=l 



£/ien we almost everywhere have: 

lim LUi(n) — H>i i = l,...,N 

n — ^+oo 

Remark 1. This theorem is interesting because the rate of decay of f](n) is the one used in 
simulations but it is still not enough because the full proposition should exclude the decays which 
are not used in the simulations i.e. the ones such that 



E 



^ r)(n) 2 < 



This last condition is often required in the proofs of theorem about the convergence of Kohonen 
algorithm, but we have checked in our simulation that there is no convergence. For example if we 
use n(n) = — the limit values of weights are not ordered at the end of the learning process for any 
initial condition (that is for any random choice of weights at the beginning of the algorithm). This 
result contradicts that one of Sadeghi (J23f). In his paper he made a numerical check but it is not 
enough since he has proven directly only the convergence in mean and not the a.e. convergence 
and in addition in his simulation he started from ordered weights. 

Remark 2. Although the theorem is formulated in the multi- dimensional case we use it in one 



dimension because the condition (2.3.13) is not easy to check in the general case. For M = 1 it 
has been seen in the paper (1251) that, if the distribution of the data is uniform and the data belong 
to the interval (0, 1), the clusters are intervals of amplitude 0.1 for N = 10. They are centered 
around the points (0.5,1.5, ). If the data are gaussian distributed, as in the biological case, 



there is no unique point satisfying condition (2.3.13) and other arguments must be used. We show 



in Section 3 that, choosing rj{n) in a particular way, it is still possible to have a.e. convergence 
but there is no theorem justifying this result. 

Proof 

The proof goes like in the paper [25]. Let B be the point w of the theorem, U e (B) be the e 
spherical neighborhood of u>, r(e) the first n for which the process cu(n) enters in U £ (B). Let a n 
be the stopping time 

a n = t(e) An = min(n, t(s)). 

The function W(n,a n ) of the theorem of Ncvel'son and Has'minskii for the case of the Kohonen 
algorithm is 

N 

W{n,uj{n)) = ^||c^(cr„)-^|| 2 

i=l 



In effect the condition (2.3.6) on the function W(n,Lu(a n )) is nothing other than the non negative 
super martingale condition, so if it is possible to show this condition it is possible to apply the 
convergence property of martingales. 
So we start proving: 



N 



Y^M\">i(n + 1) - ^;|| 2 |^ n ) - m(n) - wj 2 ] < (2.3.15) 
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The details of the proof can be found in (|25j) here we give the main results 
LW(n,u(n)) = E( W(n+l,w(n+l)) - W(n,cj(7i))|.F„ ) = 

JV 

= 2^E(||wi(o-„ + i) - w,|| 2 |.F„) - \\uji{a n ) - ^ll 2 ] = 



= y2.ri{n){u)i{a n ) - u>i) ■ V] / A(«, - Wi(a n ))f(x)dx + r} 2 (n)gi(uj(a n )) (2.3.16) 

i= i „ ^n( w (a„)), 



Where: 



gi(w(cr„)=^^ / A(v,i)\\x - uj l (a n )\\ 2 f(x)dx 

1=1 v Jn(u,(v n )) v 



with Lo{n) = (ui(n),u 2 (ri), ...,w N (ri))) 
But 

0l(w((T n ) < 

JV 

^ EE / |A(m)II*-^K)II 2 /(*)I^ 

^ EE / ^/(^ 

= iV-a-A = 2 

since |A(w,i)| < a, a > 0, and where A is a positive constant such that : 

max{||x — oj^CTn)!! 2 } < A 



(2.3.17) 



so for (2. 3.171, and the conditions (2.3.131 and (2.3.141 we obtain: 

Um V 2 (n) giKgn)) = Q 
n^+oo r){n) g(u{cr n )) 



(2.3.18) 



Hence, for n large enough, the sign of the term (2.3.16) is determined by the sign of g{uj{a n )) and 
so we have: 

JV 

^[E(|K(a„ +1 ) - u>i\\ 2 \F n ) - |KK) - ^|| 2 ] < 
i=i 

From this inequality it follows that 

JV 

W(n,u)(a n )) = IKOn) - Will 2 
i=i 

is a non negative super-martingale . 

Since W(n,u(a n )) is a non negative super-martingale, from the theorem about martingale the 
limit exists almost everywhere, in addition, by the definition of the stopping time a n and assuming 
that r(e) < C(e) we have that 

3 C > such that lim W(n,u(a n )) — C a.e. 

n — >+oo 

Hence we found the main inequality of the theorem of Nevel'son and Has'minskii : 

LW(n,u(n)) <r)(n)g(uj(n)) (2.3.19) 



From (2.3.131 we get that (2.3.51 holds 



inf 

uj(n)eV e , R (B) 



-g(u(n))] > 0. 



(2.3.20) 
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In addition 



inf W(n,u>(n)) — ► +00 per|w(n)| — > +00 



(2.3.21) 



Thus we can apply the theorem of Nevel'son and Has'minskii, where Q = Q(e, R), ( R > e > 0) 
is some constant, e is a small enough parameter, B £ = {x e M A ' xM such that 1 1 a; — w|| < e} is a 
spherical neighborhood of the limit point, B — f] s>0 B £ , U e (B) — {uj : p(u>. B) < e} 
V e (B) = R N * M \U £ (B), V £>R {B) = V £ (B) n {(w(n) : |w(n)| < i?} 



Considering the above statements ( 2.3. 19| 2.3.20| 2.3.21| and 2.3.141 we note that the hypoth- 
esis of theorem I2JJ are satisfied and so we obtain 



P{liminf p(oj(n),B) = 0} = 1 

n — *+oo 



where p(uj(n),B) = inf yeB p(u:(n),y) 



(2.3.22) 



Now by (2.3.22) we have that when n — > +00 ui{n) — > Cu with probability 1. 
In fact, since 



we have 



Thus we get: 



lim [sup W(n, w(n))] = 

w(n)-»0) „>o 



P{ lim W(n,cu(n)) = 0} = 1 

n — >+oo 



lim u>i(n) = LUi i = 1, N a.e. 

n — >+oo 



In addition as it has been proven in (|25j). the algorithm will achieve the given accuracy e within 
a finite number of updates, that is r(e) < C(e). 



3 Numerical studies 

In this section we illustrate our numerical simulations about the convergence of the Kohonen 
algorithm. First we consider a uniformly distributed data set, then a normal distributed data set, 
all the data are one dimensional as we already said. 

We see that the algorithm does not even converge in mean ( and so also not a.e.) if: 
1. rj(n), the learning parameter, decreases too fast 



2. The neighborhood functions ( A(i, v) (2.3 ) , or h(i, v, n) (2.3.4) ) have a range of action too 
small or too large. 

In addition, although the learning parameter and the neighborhood function are optimally chosen, 
the convergence of the algorithm is slow and it needs a large number of iterations in order to have 
a good accuracy. So, when the data set is not large enough, it is useful to repeat the presentation 
of data several times in random order until we have a large data set. 

In particular in the case of uniformly distributed data, chosen inside the interval [0, 1], we verify 
numerically that, having a large data set, choosing any neighborhood function and using as learn- 
ing parameter rj(n) = ^ with a > 1 the algorithm does not converge in any sense ( for different 
initial choices of weights we have different outputs) and the weights are not ordered during the 
learning procedure . Instead using 77(71) = < with a < | we have the convergence in mean. So 
the convergence property depends on the velocity of decay of 77(77). In fact if 77(71) decreases too 
fast, e.g. 77(77) = ^ , with a > 1, the updated weights change their values very little during the 
learning and so the algorithm is not able to find the final configuration of weights. 77(71) ~ l/n is 
a too fast decay because after 100 iterations already the variation of the weights is very small and 
so there is no convergence while 77(77) ~ decreases less quickly ( it assumes values less than 
0.01 from 77 > 10 6 ) and its velocity of decrease is sufficient to have the convergence. 
The choice of 77(77) is basic not only for the convergence but also for accuracy. In fact we can have 
the convergence of the algorithm though the algorithm is not able to identify all the limit weights 
but only some of them. This happens when the weights are updated too fast in the last part of 
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the learning procedure or when the range of r)(n) does not cover all the interval (0, 1), for example 
when the range of r](n) is (0.5, 1). 
We analyzed the following 77(71) : 

1. 77(77) = . l = 

2 - »?(») = Ttgjnj 

3. 77(77) = 77i(2i)^ 
4- V(n) = 4= 



(where 77, and 77/ are respectively the initial and final value of the function 77 and n max the 
maximum number of iterations). For all these cases we have convergence in mean, but for each 
case there is a different accuracy . 
Choosing 



1. 77(77) 

2. 77(77) 



r/i(l 







^6*log(n) 

we have convergence a.e. The values of the constants and the particular forms of the functions 
77(77) have been determined for satisfying the constraint < 77(77) < 1. 

Before explaining the reasons of this statement, we want to discuss the connection of the con- 
vergence with the values of the parameters. The choices of the parameters depend on the data 
distribution. For example for the case 3), in the case of uniformly and normally distributed data, 
generally we have convergence if we choose 77^ between 0.1 and 0.9 and 77/ between 10 -6 and 0.1; 
in the case 1) of the second list the range of 77, is (0.1,0.9). Instead for example with log-normal 
distributed data the range of 77, is (0.4, 0.9) and of 77/ is (0.01, 0.1) in the case 3) and in the other 
case the range of r\i is (0.4,0.8) . 

After many simulations we saw that there is convergence in mean for 77(77) such that: 



1 / \ \/6 * loq(n) 



instead there is a.e. convergence for 77(77) such that 



log(n) 



< r]{n) <r)i(l- 



:) 



nmax 



The convergence depends also on the values of parameters concerning the neighborhood function, 



o.a - 
0.7* 



2000 4000 eooo sooo 1 0000 




2000 4000 sooo sooo 1 0000 



Figure 3: On the left we have 77(71) 



on the right 77(77) = rji{l 



that is the range of the action of the winner neuron which is determined by s in the case of 
A(i, v), and by Ui and 07 in the case of h(i, v, n). The choice of s depends strongly on the number 
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of weights we fix at the beginning and the number of iterations. For example using a data set 

of about 10000 uniformly distributed data if we choose 77(71) = v) with s = 1 

as neighborhood function and we want to find 30 groups we do not have the convergence (the 
weights are not ordered) but changing the value of s conveniently (in this case s > 2) we obtain 
the convergence. 

If the data set is smaller than 10000, s is larger than the one of the previous example. 

In the case of the h neighborhood function the best choices of <7j and 07 are the following: 



N 

- (3.0.23) 
(j f = 0.01 (3.0.24) 

where N is the number of weights. 

We have more than one choice for the parameters to obtain the convergence but different choices 
give different outputs. We illustrate some examples. Finding out 10 weights for a data set of 

10000 uniformly distributed data, using 77(71) = y/ e * l ° 9 ( n \ an( j usm g n jf we choose Oi = 20 

V(»i)+i 

and a/ = 0.01 the algorithm converges and the range of values of weights is (0.48,0.50), in this 
case the network identifies 10 different values inside that interval; instead if we choose av = 5 



and (7 f = 0.01 the range is (0.37,0.65). We see that the best solution is given by 3.0.23 3.0.24 
because we have the biggest range of the weights values, in this case is 0.19 to 0.82. It is important 
to have the range that covers all the interval of the data set because otherwise we do not find 
the optimal partition. Since we know that for any data distribution the expectation of weights 
converges, a small range indicates that the network is able to find only some of the limit values 
of weights, in fact for the reported example, with N = 10, we know from [25] that the limit values 
are: 0.05,0.15,0.25,0.35,0.45,0.55,0.65,0.75,0.85,0.95, so the range of the weights values must 
be (0.05,0.95), more or less. In the worst choice the network identifies only one limit value. It 
happens because the range of action is too big; in this case the algorithm updates simultaneously 
too many weights and they converge to the same value. 

An analogous situation happens using A(z, v) as a neighborhood function. Using 77(71) = V 6 *'°g(") ^ 

V(n) + 1 

searching always 10 weights for a data set of 10000 uniformly distributed data, if we choose s > 1 
the algorithm converges but the range of weights values change for different choices of s. Increasing 
s the range of weights becomes smaller and the weights converge to the same limit if s = 10. To 
be more precise if s is equal to N, the network generates N weights ( in this case N = 10) with 
the same value. The biggest range, in this case, is obtained with s = 1. If the number of weights 
increases the best choice of s is always the minimum values of s by which we obtain the convergence 
of the algorithm. For example in the case we search 50 weights the best choice is s = 3. 
Summarizing to obtain the convergence we must choose 77(71) with a convenient monotone decay 
and with a large range; in addition we must estimate the right parameters of the neighborhood 
function such that we have convergence and the maximum range for the weights values in order 
to determine the optimal partition of data set. 

As we said previously the error of the expectation of the weights varies for different choices of 
77(71), and for some choices of 77(71) we have a.e convergence. 

This statement is based on the following analysis: we run the Kohonen algorithm 1000 times for 
different data sequences. We use at the beginning a set of uniformly distributed data of 4000 
elements , then 10000,20000,30000,60000,120000,150000 and 250000. This procedure has been 
done with all the mentioned 77(71) and both A and h. 

At the end of algorithm running for each data set we have 1000 cases of weights limit values. The 
mean value of these cases actually converges to the centers of the optimal partition of the interval 
(0, 1) for all 77(71) and for each neighborhood function. 

In addition the average error of limit weights, with respect to the exact values of the centers, 
decreases on increasing the number of iterations for 77(71) = V 6 * l °9( n ) anc ] _ — ^— ) and any 
neighborhood function; but using A(i, v) the error decreases more quickly. 

Moreover the computing time of the algorithm using h is about 7 times longer than the one using 
A and the accuracy of weights on the boundaries is worse using h. 
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The weights near the border are not updated symmetrically and so they are shifted inward by an 
amount of the order of ^hyi where N is the number of weights in the case of A(i, v) while, using 
h, the weights which are shifted are 4, two for each boundary. 

Now we illustrate the quoted results. The following tables ( 1|2|3|4 ) show the evolution of the error. 
In the first table there are the average errors of each weight using A as neighborhood function 
and 10000 uniformly distributed inputs; instead in the third table there are 60000 uniformly 
distributed inputs. In the second and in the fourth table it is shown the case of h as neighborhood 
function. The weights are TV = 10, so the limit values , which are written in the firs column of 
every tables, are: 0.05,0.15,0.25,0.35,0.45,0.55,0.65,0.75,0.85,0.95 



log(n) 



%(1 — — ) 



sfn+1 



0.05 

0.15 
0.25 
0.35 
0.45 
0.55 
0.65 
0.75 
0.85 
0.95 



0.0561 
0.0189 
0.0239 
0.0185 
0.0202 
0.0189 
0.0198 
0.0238 
0.0180 
0.0555 



0.0562 
0.0197 
0.0246 
0.0191 
0.0211 
0.0196 
0.0208 
0.0243 
0.0188 
0.0556 



0.0576 
0.0358 
0.0396 
0.0364 
0.0385 
0.0366 
0.0389 
0.0382 
0.0351 
0.0568 



0.0559 
0.0088 
0.0183 
0.0104 
0.0104 
0.0107 
0.0111 
0.0185 
0.0089 
0.0558 



0.0559 
0.0160 
0.0216 
0.0158 
0.0168 
0.0161 
0.0167 
0.0216 
0.0158 
0.0557 



Table 1: Mean error of each weight for 10000 iterations using A as neighborhood function. The 77 are 
shown in the first row. 







i 


i 


%(1 " ) 




log(n) 




'* V nmax * 




0.05 


0.1063 


0.1061 


0.1045 


0.1059 


0.1061 


0.15 


0.0565 


0.0564 


0.0587 


0.0557 


0.0562 


0.25 


0.0339 


0.0342 


0.0440 


0.0305 


0.0325 


0.35 


0.0262 


0.0267 


0.0409 


0.0197 


0.0238 


0.45 


0.0214 


0.0222 


0.0388 


0.0109 


0.0183 


0.55 


0.0213 


0.0223 


0.0379 


0.0106 


0.0185 


0.65 


0.0256 


0.0264 


0.0403 


0.0189 


0.0232 


0.75 


0.0336 


0.0343 


0.0460 


0.0302 


0.0322 


0.85 


0.0567 


0.0570 


0.0615 


0.0560 


0.0566 


0.95 


0.1067 


0.1069 


0.1065 


0.1064 


0.1068 



Table 2: Mean error of each weight for 10000 iterations using h as neighborhood function. The 77 are 
shown in the first row. 



We give some examples to illustrate the error evolution using 77(71) = 7^(1 — ^ ) and 77(71) = 

\/&*log(n) 1 . r 

v r- , , — , the case oi a.e. convergence. 

The tables [5] and [6] concern the application of the algorithm with 4000, 10000, 20000, 30000, 
60000, 120000, 150000, 250000 iterations, which are written in the first column, and using A as 
neighborhood function As seen in the tables the error decreases faster using 77(71) = 77^(1 — n ) 
and it decreases increasing the iterations; see the figures [4|5|6| and figures [7j [8] [9] In some of these 
pictures there are the histograms of the limit weights values obtained running the algorithm 1000 
times for different numbers M of iterations using every time a specific 77(77). The histograms show 
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55 


0.0200 


0.0189 


0.0384 


0.0071 


0.0114 





65 


0.0188 


0.0178 


0.0350 


0.0079 


0.0115 





75 


0.0218 


0.0211 


0.0343 


0.0179 


0.0183 





85 


0.0181 


0.0170 


0.0336 


0.0070 


0.0108 





95 


0.0556 


0.0557 


0.0553 


0.0560 


0.0556 



Table 3: Mean error of each weight for 60000 iterations using A as neighborhood function. The rj are 
shown in the first row. 



log(n) 



\/log(n) 



—) 



\j 6*log(n) 



0.05 


0.1061 


0.1060 


0.1040 


0.1064 


0.1059 


0.15 


0.0559 


0.0558 


0.0580 


0.0560 


0.0555 


0.25 


0.0326 


0.0321 


0.0362 


0.0302 


0.0300 


0.35 


0.0243 


0.0235 


0.0340 


0.0185 


0.0193 


0.45 


0.0206 


0.0196 


0.0361 


0.0082 


0.0125 


0.55 


0.0215 


0.0205 


0.0373 


0.0088 


0.0132 


0.65 


0.0252 


0.0244 


0.0356 


0.0191 


0.0198 


0.75 


0.0321 


0.0317 


0.0345 


0.0306 


0.0299 


0.85 


0.0548 


0.0548 


0.0565 


0.0562 


0.0555 


0.95 


0.1051 


0.1053 


0.1045 


0.1064 


0.1061 



Table 4: Mean error of each weight for 60000 iterations using h as neighborhood function. The 77 are 
shown in the first row. 
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0.0562 


0.0108 


0.0183 


0.0132 


0.0135 


0.0144 


0.0139 


0.0192 


0.0110 


0.0559 
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0.0559 
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0.0558 
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0.0557 
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0.0071 
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0.0179 


0.0070 


0.0560 


120000 


0.0559 


0.0065 


0.0176 


0.0072 


0.0060 


0.0064 


0.0072 


0.0180 


0.0067 


0.0560 


150000 


0.0560 


0.0065 


0.0180 


0.0070 


0.0058 


0.0060 


0.0071 


0.0178 


0.0065 


0.0560 


250000 


0.0560 


0.0062 


0.0178 


0.0067 


0.0050 


0.0054 


0.0069 


0.0180 


0.0064 


0.0562 



Table 5: Evolution of the mean error for each weight in the case of 77 (n) = 77^(1 — ).erri means: 

mean error of weight i 



how, increasing M, only in the case of 77(71) = 77,(1 — nr ^ ax ) and rj(n) — ^7=^^ the 
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30000 


0.0558 


0.0128 


0.0190 


0.0135 
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Table 6: Evolution of the mean error for each weight in the case of 77(71) = "^j=^f^ 
error of weight i 



.err.; means: mean 



of the histograms tend to 0, each around a limit value of the weight, as we expect since we have 
a.e. convergence. There are only the histograms for 77(71) = as example of the convergence 

in mean since the other cases are similar. 

The velocity of convergence is very slow after 100000 iterations, it needs many iterations only to 
change one weight nearer to its limit value; so to construct the histograms with ten columns we 
need a huge data set. In addition in the pictures with the plot of medium error of each weight 
we can see that the error decreases always increasing the iterations only in the case of those 77(71) 
which assure the a.e. convergence. Focusing the attention on the pictures of the error we see that 
from 150000 iterations in the case of 77(77.) = 7/^(1 — " ) the error decreases very slowly and for 
the third and eight weight there is a little increase of error ~ 0.002, it depends on the propagation 
of the error of the weights from the boundaries. 
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Figure 4: Histograms in the case of 77(77) = r\i (1 
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The a.e. convergence is guaranteed in the case of 77(77) = 77^(1 



: ) and 77(77) 



by the monotonically decrease of standard deviation of weights as the figures U0\ and 1 1 show 
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Figure 5: Histograms in the case of 77(71) = ~7=^p^ , uniformly distributed data and for different 
numbers M of iterations 
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Figure 6: Histograms in the case of 77(71) 
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Similar results are obtained with the normally distributed data but for a special choice of the 
learning parameter. In fact our theorem does not hold for gaussian distribution as we already 
mentioned. 

Also in this case we have convergence in mean with all the 77 and for any neighborhood function; 
and convergence a.e for 77(71) = 77,(1 - ^^) and 77(71) = . 

We show in the following histograms (figures 12|13 1 and plots (figures 14|15 1 only the results for 
a.e convergence: 
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Figure 7: Plot of the mean error for each limit weight in the case of rj{n) = 77^(1 — nj ^ ax ) and uniformly 
distributed data. The numbers on the x axes indicate the following iterations: 4000, 10000, 20000, 



30000, 60000, 120000, 150000, 250000. 
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Figure 8: Plot of the medium error of each limit weight in the case of rj(n) = ^ & *n°+^ ~ an d uniformly 
distributed data. The numbers on the x axes indicate the following iterations: 4000, 10000, 20000, 
30000, 60000, 120000, 150000, 250000. 



Before explaining our application of Kohonen algorithm to microarrays data, we make some 
remarks on the repetitions of data set presented to the network. This procedure is necessary 
because the data set is small for microarray data. The accuracy improves by increasing the 
number of samples and this technique does not change the limit if there is the almost everywhere 
convergence. 

To be sure we have done the same analysis shown above with a data set of 2000 elements repeated 
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Figure 9: Plot of the medium error of each limit weight in the case of 77(77) = ; og („) and uniformly 
distributed data. The numbers on the x axes indicate the following iterations: 4000, 10000, 20000, 
30000, 60000, 120000, 150000, 250000. 
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Figure 10: The standard deviation of the weights in the case of rj(n) = ^(1 — n ™ ax ) and uniformly 
distributed data. The numbers on the x axes indicate the following iterations: 4000, 10000, 20000, 
30000, 60000, 120000, 150000, 250000. 



at the beginning 2 times, then 5, 10,15,30,60 and 125 so to have the same iterations of the previous 
analysis. The results are similar, that is we have a.e. convergence for the previous case of 77(71), 
that is : 

• n{n) = 77,(1 5—) 

/ \ \/G*loq(n) 

• 77(71) = V r yy ' 
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Figure 11: The standard deviation of the weights in the case of 77(77.) = ^y-^-f" ' and uniformly 
distributed data. The numbers on the x axes indicate the following iterations: 4000, 10000, 20000, 
30000, 60000, 120000, 150000, 250000. 
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Figure 12: Histograms in the case of 77(77) = 
numbers M of iterations 



; ) and normal distributed data and for different 



We show the histograms [16] [T7|in these two case to illustrate this statement: 

4 Application to microarrays data 

In this section we show how we have applied the Kohonen network to micro-arrays data set. Fol- 
lowing our strategy we have made the cluster analysis of the data for each sample and compared 
the genes appearing in the nearby clusters, in this way we exploit the neat convergence properties 
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Figure 14: The standard deviation of the weights in the case of rj(n) = 77,(1 — " ) and normal 
distributed data.The numbers on the x axes indicate the following iterations: 4000, 10000, 30000, 60000, 
90000, 150000, 250000, 350000. 



of the one-dimensional case. The set of data we analyzed is the same as the one published in 
|30j where there is an exhaustive description of microarrays sample preparation. In brief total 
RNA (ttlRNA) was extracted and purified from mammary glands in control and transgenic mice. 
ttlRNA were pooled to obtain three replicates for the mammary glands of 2-week-pregnant WT 
BALB/c mice (wk2prg), of 22 week old untreated BALB-neuT mice(wk22nt), and of 22 week old 
primed and boosted BALB-neuT mice(wk22pb)and two replicates for the mammary glands of 10 
week old untreated BALB-neuT mice ( wklOnt). Chips were scanned to generate digitized image 
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Figure 15: The standard deviation of the weights in the case of 77(71) = ^M+l — an d normal distributed 
data.Thc numbers on the x axes indicate the following iterations: 4000, 10000, 30000, 60000, 90000, 
150000, 250000, 350000. 




Figure 16: Histograms in the case of 77(71) = 77^(1 — ^ ) and uniformly distributed data and for 
different numbers M of iterations obtained repeating the data several times. 



data (DAT) files. 

DAT files were analyzed by MAS 5.0 to generate background- normalized image data (CEL files). 
Probe set intensities were obtained by means of the robust multi array analysis method ([!]). The 
full data set was normalized according to the invariant set method. The full-shaped procedure 
described by Saviozzi et al ([29 ) was then applied. The resulting 5482 probe sets were analyzed 
by combining two statistical approaches implemented in significance analysis of micro-arrays ( 3 ): 
two classes unpaired sample method and the multi classes response test. This analysis produced 
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Figure 17: Histograms in the case of rj(n) = v ^'° g ^" - and uniformly distributed data and for different 
numbers M of iterations obtained repeating the data several times. 



a total of 2179 probe sets differentially expressed in at least one of the three experimental groups. 
The 2179 probe sets were converted in virtual two dye experiments comparing all replicates of 
each experimental groups with index j = 1, ..3 ( i.e wkl0nt > % = 1,2; ■ ^22 P b, • = 1 3 s 

1 ox j j \ wkzprgj ' ' wkZprgj ' wklprgj 1 ' 

Therefore we have 3 replicates of 8 experimental groups. 

We apply the Kohonen algorithm to the 2179 probe sets. Our main aim is to detect which genes 
are up-modulated in wk22pb respect to wk22nt and wklOnt. 

The first step is to implement the one dimensional Kohonen algorithm in Matlab and study its 
convergence putting inside as inputs data the 2179 expression levels of genes of any experimen- 
tal group. In particular our set contains the log values of expression levels of genes, which are 
normally distributed in any experimental group; so, with regard the numerical studies done, we 
choose rji(l — 71 ), with rji = 0.8 because we have the almost everywhere convergence, and A as 
neighborhood function, since we have the best accuracy in this case. 

For each experimental group the input set f2 is only of 2179 elements so to improve the accuracy 
we repeat the data presentation set several times in different order. We present the data set 100 
times, in such way the input set is almost 220000 patterns; in this case the mean error of weights 
is about 0.01 (as we have seen in our previous studies) and since the average variability of the 
expression levels of genes among the replicates is about 0.133, this error is acceptable. 
We run the Kohonen algorithm fixing N , the number of weights, equal to 30 and then we choose 
among the limit values found only those weights with a distance greater than two times the average 
variability of the expression levels of genes among the replicates. We select the weights in this way 
because otherwise the assignment of a gene to a particular cluster could be not unique. The choice 
of ./V = 30 has been done analyzing the distribution of the data and considering the variability 
of the expression levels of genes among the replicates. To obtain weights with a distance greater 
than two times the average variability of genes expressions we can fix also N = 15, but in this way 
we lose precision in finding weights at the boundaries of the data set interval. It happens because 
the data are normally distributed, therefore they are concentrated near the mean of the data set 
and more we move away from mean more the distance between weights increases, therefore it is 
better choosing more weights than those which have an optimal distance between them, such that 
it is possible to detect more weights at boundaries, since we want to find out up-modulated genes. 
Once we have found the limit weights values we separate the data into the identified clusters. 
This procedure has been done for every experimental group indexed by j ( Zklvrn 1 * = li 2; 
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wk^prg* ' wk2prg- * = s0 we nave eight classifications for each replicate. In addition we 

choose one of the 24 (8 for each replicate) sequences of limit weights and we separate the data of 
every experimental into the clusters identified by the chosen sequence. In this way we obtained 
24 classifications for every sequence of limit weights (that are 24). 

Once we have obtained these classifications we improve the precision of assignment of genes con- 
sidering their biological variability; therefore we have checked if the expression level of genes which 
lay on the boundaries of a cluster can be considered really belonging to that cluster or, because its 
variability to its neighbor. In detail, if the expression of the genes, incremented of its biological 
error, is closer to the weight of its cluster than to its nearest weight, the assignment of the gene 
does not change, otherwise the gene is assigned to the cluster corresponding to its nearest weight. 
We can observe that, since the limit weights are ordered, the clusters, with which they are asso- 
ciated can be ordered in ascending way. Therefore in clusters related to high index we find genes 
with a greater expression level than in clusters with low index. 

For each replicates we select only those genes that are in clusters with high index for the classi- 
fications obtained respect the limit weights found analyzing the data of ^^prg- * = 1j -3 an d in 
low clusters for classifications obtained by means of the limit weights found analyzing the data of 
ii)fc2p"g J ' wk^prg 1 ' J = 1,2. After this procedure we have identified a set of 70 up-modulated genes 
in wk22pb respect to wk22nt and wklOnt. Among these genes there are 25 ones that have not 
been found by Quaglino et al. These new genes found are shown in the table [7] 



5 Conclusion 

We have improved the theorem on the a.e. convergence of the Kohonen algorithm because we 
prove the sufficiency of a slow decay of the learning parameter, ^ rj{n) = oo, similar to the one 
used in the applications. The theorem is not complete because we are not able to prove the 
necessity of such condition and future work should be concentrated on this point. But for doing 
such a research one has to find something functional more similar to the Liapunov functional 
than the one currently available. This could make it possible using some argument of convergence 
similar to the one used for the simulated annealing. We made also many numerical simulations of 
convergence in order to find the choice of 77(77.) which minimizes the rate of decrease of the average 
error and also for finding which version of the learning algorithm is better to use. We found that 
the optimal choice is: 

y/Q*log{n) .... 71 n 

-j= < ri(n) < rjjiil ) 

^/(n) + 1 ' nmax 

The algorithm with A neighborhood function is better than the one using the function h since 
it has bad convergence properties. The latter one is commonly used in the simulations. After this 
detailed analysis we applied our optimal choice to the genetic expression levels of tumor cells. The 
25 genes identified by us were also consistent with the biological events investigated by Quaglino 

my 
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Table 7: The up modulated genes found out 
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